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Abstract — Vehicle  tracking  using  airborne  wide-area  motion 
imagery  (WAMI)  for  monitoring  urban  environments  is  very 
challenging  for  current  state-of-the-art  tracking  algorithms,  com¬ 
pared  to  object  tracking  in  full  motion  video  (FMV).  Character¬ 
istics  that  constrain  performance  in  WAMI  to  relatively  short 
tracks  range  from  the  limitations  of  the  camera  sensor  array 
including  low  frame  rate  and  georegistration  inaccuracies,  to 
small  target  support  size,  presence  of  numerous  shadows  and 
occlusions  from  buildings,  continuously  changing  vantage  point 
of  the  platform,  presence  of  distractors  and  clutter  among  other 
confounding  factors.  We  describe  our  Likelihood  of  Features 
Tracking  (LoFT)  system  that  is  based  on  fusing  multiple  sources 
of  information  about  the  target  and  its  environment  akin  to 
a  track-before-detect  approach.  LoFT  uses  image-based  feature 
likelihood  maps  derived  from  a  template-based  target  model, 
object  and  motion  saliency,  track  prediction  and  management, 
combined  with  a  novel  adaptive  appearance  target  update  model. 
Quantitative  measures  of  performance  are  presented  using  a  set 
of  manually  marked  objects  in  both  WAMI,  namely  Columbus 
Large  Image  Format  (CLIF),  and  several  standard  FMV  se¬ 
quences.  Comparison  with  a  number  of  single  object  tracking 
systems  shows  that  LoFT  outperforms  other  visual  trackers, 
including  state-of-the-art  sparse  representation  and  learning 
based  methods,  by  a  significant  amount  on  the  CLIF  sequences 
and  is  competitive  on  FMV  sequences. 

I.  Introduction 

Target  tracking  remains  a  challenging  problem  in  computer 
vision  [1]  due  to  target-environment  appearance  variabili¬ 
ties,  significant  illumination  changes  and  partial  occlusions. 
Tracking  in  aerial  imagery  is  generally  harder  than  tradi¬ 
tional  tracking  due  to  the  problems  associated  with  a  moving 
platform  including  gimbal-based  stabilization  errors,  relative 
motion  where  sensor  and  target  are  both  moving,  seams  in 
mosaics  where  stitching  is  inaccurate,  georegistration  errors, 
and  drift  in  the  intrinsic  and  extrinsic  camera  parameters  at 
high  altitudes  [2].  Tracking  in  Wide-Area  Motion  Imagery 
(WAMI)  poses  a  number  of  additional  difficulties  for  vision- 
based  tracking  algorithms  due  to  very  large  gigapixel  sized 
images,  low  frame  rate  sampling,  low  resolution  targets,  lim¬ 
ited  target  contrast,  foreground  distractors,  background  clutter, 
shadows,  static  and  dynamic  parallax  occlusions,  platform 
motion,  registration,  mosacing  across  multiple  cameras,  object 
dynamics,  etc.  [3],  [4],  [5],  [6],  [7],  [8],  [9],  [10],  [11], 
[12],  [13].  These  difficulties  make  the  tracking  task  in  WAMI 


more  challenging  compared  to  standard  ground-based  or  even 
narrow  field-of-view  (aerial)  full  motion  video  (FMV). 

Traditional  visual  trackers  either  use  motion/change  detec¬ 
tion  or  template  matching.  Persistent  tracking  using  motion 
detection-based  schemes  need  to  accommodate  dynamic  be¬ 
haviors  where  initially  moving  objects  can  become  station¬ 
ary  for  short  or  extended  time  periods,  then  start  to  move 
again.  Motion-based  methods  face  difficulties  with  registra¬ 
tion,  scenes  with  dense  set  of  objects  or  near- stationary  targets. 
Accuracy  of  background  subtraction  and  track  association 
dictate  the  success  of  these  tracking  methods  [10],  [9],  [14], 
[15].  Template  trackers  on  the  other  hand,  can  drift  off  target 
and  attach  themselves  to  objects  that  seem  similar,  without  an 
update  to  the  appearance  model  [16],  [2]. 

Visual  tracking  is  an  active  research  area  with  a  recent 
focus  on  appearance  adaptation,  learning  and  sparse  repre¬ 
sentation.  Appearance  models  are  used  in  [17],  [18],  [19], 
[20],  classification  and  learning  techniques  have  been  studied 
in  [21],  [22],  and  parts-based  deformable  templates  in  [23]. 
Gu  et  al.  [20]  stress  low  computation  cost  in  addition  to  ro¬ 
bustness  and  propose  a  simple  yet  powerful  Nearest  Neighbor 
(NN)  method  for  real-time  tracking.  Online  multiple  instance 
learning  (MILTrack)  is  used  to  achieve  robustness  to  image 
distortions  and  occlusions  [21].  The  P-N  tracker  [22]  uses 
bootstrapping  binary  classifiers  and  shows  higher  reliability 
by  generating  longer  tracks.  Mei  et  al  [24],  [25]  propose  a 
robust  tracking  method  using  a  sparse  representation  approach 
within  a  particle  filter  framework  to  account  for  pose  changes. 

We  have  developed  the  Likelihood  of  Features  Tracking 
(LoFT)  system  to  track  objects  in  WAMI.  The  overall  LoFT 
tracking  system  shown  in  Figure  1,  can  be  broadly  organized 
into  several  categories  including:  (i)  Target  modeling,  (ii) 
Likelihood  fusion,  and  (iii)  Track  management.  Given  a  target 
of  interest,  it  is  modeled  using  a  rich  feature  set  including 
intensity/color,  edge,  shape  and  texture  information  [4],  [26]. 
The  novelty  of  the  overall  LoFT  system  stems  from  a  combina¬ 
tion  of  critical  components  including  a  flexible  set  of  features 
to  model  the  target,  an  explicit  appearance  update  scheme, 
adaptive  posterior  likelihood  fusion  for  track-before-detect, 
a  kinematic  motion  model,  and  track  termination  working 
cooperatively  in  balance  to  produce  a  reliable  tracking  system. 
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Fig.  1.  Likelihood  of  Features  Tracking  (LoFT)  processing  pipeline  showing  major  components  including  feature  extraction,  feature  likelihood  map  estimation 
by  combining  with  the  template,  vehicle  detection  using  support  vector  machine  (SVM)  classification,  fusion  module  that  also  incorporates  prediction  based 
motion  and  background  subtraction  based  motion,  to  produce  a  fused  likelihood  for  target  localization  after  track  extension.  The  track  management  includes 
termination  module,  prediction  with  or  without  multiple  hypothesis  tracking  (MHT)  and  object  appearance  updating  for  adaptive  target  modeling. 


The  rest  of  the  paper  is  organized  as  follows.  Section  II 
describes  estimating  features  and  fusing  the  posterior  likeli¬ 
hood  maps.  Section  III  describes  the  novel  target  appearance 
modeling  and  adaptive  update  modules.  Section  IV  describes 
the  tracker  management  component  that  is  often  lacking  in 
other  systems  including  smooth  trajectory  assessment  and 
appropriate  tracker  termination  to  maintain  track  purity.  Ex¬ 
perimental  results  for  both  CLEF  WAMI  and  FMV  video 
sequences  are  described  in  Section  V  followed  by  conclusions. 

II.  Likelihood  Fusion 

The  target  area  is  modeled  using  features  that  can  be 
grouped  into  categories  such  as  block,  edge,  shape  and  texture 
based.  We  use  block-based  intensity  features,  gradient  magni¬ 
tude  for  edge-based  features,  gradient  orientation  information 
using  Histogram  of  Oriented  Gradients,  eigenvalues  of  the 
Hessian  matrix  for  shape  information,  and  local  binary  patterns 
for  texture.  See  [4]  for  more  details  on  these  feature  descriptors 
and  their  integral  histogram-based  implementation. 

LoFT  uses  a  recognition-based  target  localization  approach 
using  the  maximum  likelihood  of  the  target  being  within  the 
search  region  conditioned  on  a  feature.  A  likelihood  map  is 
estimated  for  each  feature  by  comparing  feature  histograms 
of  the  target  within  the  search  region  using  a  sliding  window- 
based  approach  (see  Fig.  1).  Each  pixel  in  the  likelihood  map 
indicates  the  posterior  probability  of  that  pixel  belonging  to 
the  target.  Fusing  features  enables  adaptation  of  the  tracker 
to  dynamic  environment  changes  and  target  appearance  vari¬ 
abilities.  Using  a  track-before-detect  approach  provides  more 
robust  localization  especially  for  cluttered  dense  environments 
[27].  Feature  adaptation  uses  a  weighted  sum  Bayes  fusion 
rule  that  tends  to  perform  better  than  other  methods  such  as 
the  product  rule  [28].  The  critical  aspect  in  weighted  sum 
fusion  is  the  relative  importance  of  feature  maps.  Each  feature 
performs  differently  depending  on  the  target  characteristic  and 
environmental  situations  during  tracking.  Equally  weighted 
fusion  of  likelihood  maps  can  decrease  performance,  when 
some  of  the  features  are  not  informative  in  that  environment. 


The  importance  assigned  to  each  feature  can  be  adapted  to 
the  changes  in  target  pose  and  the  surrounding  background. 
Temporal  feature  weight  adaptation  can  improve  performance 
under  changes  that  are  not  explicitly  modeled  by  the  tracker. 

We  considered  two  weighting  schemes  including  the  Vari¬ 
ance  Ratio  (VR)  [19]  and  the  Distractor  Index  [4].  LoFT 
fuses  the  histogram-based  and  correlation-based  features  in 
two  stages.  First,  histogram-based  features  are  fused  using  the 
VR  method  [19],  [29]  which  adaptively  weights  the  features 
according  to  the  discriminative  power  between  the  target  and 
the  background  measured  using  the  two-class  ratio  of  total 
to  within  class  variances.  Second,  non-histogram  (i.e.  correla¬ 
tion)  based  features  are  combined  with  the  fused  histogram- 
based  features  using  the  Distractor  Index  method  proposed  by 
Palaniappan  et  al.  [4].  In  this  method,  the  number  of  local 
maxima  within  90%  of  the  peak  likelihood  and  within  the 
spatial  support  of  the  object  template,  A/r,  are  used  as  the 
number  of  viable  peaks  for  the  ith  feature,  mi  £  [l,oo). 
Fusion  feature  weights  in  LoFT  are  then  calculated  using  [4], 

n 

Wi  «  mfx( 1/mi)-1.  (1) 

i=l 

Consequently,  high  distractor  index  values  will  result  in  low 
weights  for  unreliable  features.  By  assuming  the  environment 
does  not  change  drastically  across  frames,  the  system  fuses  the 
likelihood  maps  of  frame  k  using  the  feature  weights  which 
were  estimated  at  frame  k  -  1.  Calculating  feature  weights 
dynamically  enables  the  tracker  to  cope  with  small  appearance 
changes  in  target  and  environment.  Strong  local  maxima  in 
the  fused  map  which  exceeds  a  predetermined  threshold  are 
considered  as  potential  target  locations. 

III.  Target  Modeling 

LoFT  [4]  uses  the  principle  of  single  target  template-based 
tracking  where  target  features  are  used  to  match  an  area  or 
region  in  subsequent  frames.  Static  template-based  tracking 
has  been  studied  in  computer  vision  since  the  1970’s.  Cur¬ 
rently,  generative  models  such  as  [17],  [18]  or  discriminative 
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models  such  as  [30],  [21]  all  have  online  and  offline  versions 
to  robustly  adapt  to  object  appearance  variability.  Recently, 
several  trackers  based  on  sparse  representation  have  shown 
promise  in  handling  complex  appearance  changes  [25],  [31], 
[32].  Our  dynamic  appearance  adaptation  scheme  maintains 
and  updates  a  single  template  by  estimating  affine  changes  in 
the  target  to  handle  orientation  and  scale  changes  [33],  using 
multiscale  Laplacian  of  Gaussian  edge  detection  followed 
by  segmentation  to  largely  correct  for  drift.  Multi-template 
extensions  of  the  proposed  approach  are  straightforward  but 
computationally  more  expensive.  LoFT  is  being  extended  in 
this  direction  by  parallelization  of  the  integral  histogram  [34] . 


A.  Appearance  Update 

Given  a  target  object  template,  Ts,  in  the  initial  starting 
image  frame,  Js,  we  want  to  identify  the  location  of  the 
template  in  each  frame  of  the  WAMI  sequence  using  a  likeli¬ 
hood  matching  function,  M(-).  Once  the  presence  or  absence 
of  the  target  has  been  determined,  we  then  need  to  decide 
whether  or  not  to  update  the  template.  The  target  template 
needs  to  be  updated  at  appropriate  time  points  during  tracking, 
without  drifting  off  the  target,  using  an  update  schedule  which 
is  a  tradeoff  between  plasticity  (fast  template  updates)  and 
stability  (slow  template  updates).  The  template  search  and 
update  model  can  be  represented  as, 


x 


* 

k+ 1 


argma xM(7(WiC^t)(x  +  c),T„),  k  >  s, 

xEA fw 


u  >  s 
(2) 


2~/c+i 


J(fc+I,(xife+1  +  c),  ^  /(xfc+i5  -ffc+ij  Tu))  >  Th 
cEA/t) 

Tu ,  otherwise 


where  M(-)  denotes  the  posterior  likelihood  estimation  oper¬ 
ator  that  compares  the  vehicle/car  template  from  time  step  u, 
Tu  (with  support  region  or  image  chip,  c  G  N't),  within  the 
image  search  window  region,  Nw,  at  time  step  k  +  1.  The 
optimal  target  location  in  Ik+i  is  given  by  x£+1.  If  the  car 
appearance  is  stable  with  respect  to  the  last  updated  template, 
Tu,  then  no  template  update  for,  Tfc+i,  is  performed.  However, 
if  the  appearance  change  function,  /(•),  is  above  a  threshold 
indicating  that  the  object  appearance  is  changing  and  we  are 
confident  that  this  change  is  not  due  to  an  occlusion  or  shadow, 
then  the  template  is  updated  to  the  image  block  centered  at 
x£+1.  Instead  of  maintaining  and  updating  a  single  template 
model  of  the  target  a  collection  of  templates  can  be  kept  (as  in 
learning-based  methods)  using  the  same  framework,  in  which 
case  we  would  search  for  the  best  match  among  all  templates 
in  Eq.  2.  Note  that  if  u  =  s  then  the  object  template  is  never 
updated  and  remains  identical  to  the  initialized  target  model; 
u  =  k  naively  updates  on  every  frame.  Our  adaptive  update 
function  /(•)  considers  a  variety  of  factors  such  as  orientation, 
illumination,  scale  change  and  update  method. 

In  most  video  object  tracking  scenarios  the  no  update 
scheme  rarely  leads  to  better  performance  [17]  whereas  naively 
updating  on  every  frame  will  quickly  cause  the  tracker  to  drift 
especially  in  complex  video  such  as  WAMI  [4];  making  the 


Fig.  2.  Orientation  and  intensity  appearance  changes  of  the  same  vehicle 
over  a  short  period  of  time  necessitates  updates  to  the  target  template  at  an 
appropriate  schedule  balancing  plasticity  and  stability. 

tradeoff  between  these  two  extremes  is  commonly  referred 
to  as  the  stability-plasticity  dilemma  [35].  Figure  2  shows 
several  frames  of  a  sample  car  from  the  CLIF  sequences  as  its 
appearance  changes  over  time.  Our  approach  to  this  dilemma 
is  to  explicitly  model  appearance  variation  by  estimating 
scale  and  orientation  changes  in  the  target  that  is  robust 
to  illumination  variation.  Segmentation  can  further  improve 
performance  [36],  [37]. 

We  recover  the  affine  transformation  matrix  to  model  the 
appearance  update  by  first  extracting  a  reliable  contour  of  the 
object  to  be  tracked  using  a  multiscale  Laplacian  of  Gaussian, 
followed  by  estimating  the  updated  pose  of  the  object  using 
the  Radon  transform  projections  as  described  below. 

B.  Laplacian  of  Gaussian 

We  use  a  multi-scale  Laplacian  of  Gaussian  (LoG)  filter  to 
increase  the  response  of  the  edge  pixels.  Using  a  series  of  con¬ 
volutions  with  scale-normalized  LoG  kernels  a2  V2G(x,  y,  a2) 
where  a  denotes  the  standard  deviation  of  the  Gaussian  filter, 

h,L(x,y,cr2)  =  Ik(x,y)  *  cr2V2G{x,y,cr2)  (3) 

we  estimate  the  object  scale  at  time  k  by  estimating  the  mean 
of  the  local  maxima  responses  in  the  LoG  within  the  vehicle 
template  region  Nt-  If  this  <r£  has  changed  from  <r*  then  the 
object  scale  is  updated. 

C.  Orientation  estimation 

The  Radon  transform  is  used  to  estimate  the  orientation  of 
the  object  [33]  and  applying  the  transform  on  the  LoG  image 
Ik,L(x,y),  we  can  denote  the  line  integrals  as: 

Rk(p,  0)  =  JJ  Ik,L (x,y)S(p  —  x  cos  0  —  y  sin  6)  dx  dy  (4) 

where  5f)  is  the  Dirac  delta  function  that  samples  the  image 
along  a  ray  (p,  0).  Given  the  image  projection  at  angle  6,  we 
estimate  the  variance  of  each  projection  profile  and  search  for 
the  maximum  in  the  projection  variances  by  using  a  second- 
order  derivative  operator  to  achieve  robustness  to  illumination 
change  [38].  An  example  of  vehicle  orientation  and  change 
in  orientation  estimation  is  shown  in  Figure  3.  This  appear¬ 
ance  update  procedure  seems  to  provide  a  balance  between 
plasticity  and  stability  that  works  well  for  vehicles  in  aerial 
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Fig.  3.  Vehicle  orientations  are  measured  wrt  vertical  axis  pointed  up.  (a) 
Car  template,  (b)  Variance  of  Radon  transform  profiles  with  maximum  at  90° 
(red  sq).  (c)  Car  template  rotated  by  45°  CCW.  (d)  Peak  in  variance  of  Radon 
transform  profiles  at  135°  (red  sq),  for  correct  change  in  orientation  of  45°. 

imagery.  More  detailed  performance  evaluation  of  orientation 
estimation  is  found  in  our  related  work  [39]. 

IV.  Track  Management 

A  robust  tracker  should  maintain  track  history  informa¬ 
tion  and  terminate  the  tracker  as  performance  deteriorates 
irrecoverably  (e.g.  camera  seam  boundary),  the  target  leaves 
the  field-of-view  (e.g.  target  exiting  the  scene),  enters  a  long 
occluded/shadow  region,  or  the  tracker  has  lost  the  target. 
LoFT  incorporates  multiple  track  termination  conditions  to 
ensure  high  precision  (track  purity)  and  enable  downstream 
tracklet  stitching  algorithms  to  operate  efficiently  during  track 
stitching.  Track  linearity  or  smoothness  guides  the  tracker  to 
select  more  plausible  target  locations  incorporating  vehicle 
motion  dynamics  and  a  module  for  terminating  the  tracker. 

A.  Smooth  Trajectory  Dynamics  Assumption 

Peaks  in  the  fused  likelihood  map  are  often  many  due  to 
clutter  and  denote  possible  target  locations  including  detrac¬ 
tors.  However,  only  a  small  subset  of  these  will  satisfy  the 
smooth  motion  assumption  (/.  e.  linear  motion).  Checks  for 
smooth  motion/linearity  is  enforced  before  a  candidate  target 
location  is  selected  to  eliminate  improbable  locations.  Figure  4 
illustrates  the  linear  motion  constraint.  The  red  point  indicates 
a  candidate  object  with  a  very  similar  appearance  to  the  target 
being  tracked,  but  this  location  is  improbable  since  it  does  not 
satisfy  the  trajectory  motion  dynamics  check  and  so  the  next 
highest  peak  is  selected  (yellow  dot).  This  condition  enforces 
smoothness  of  the  trajectory  thus  eliminating  erratic  jumps  and 
does  not  affect  turning  cars. 


Fig.  4.  When  the  maximum  peak  (red  dot)  deviates  from  the  smooth 
trajectory  assumption  (in  this  case  linearity)  LoFT  ignores  the  distractor  to 
select  a  less  dominant  peak  satisfying  the  linearity  constraint  (yellow  dot). 

B.  Prediction  &  Filtering  Dynamical  Model 

LoFT  can  use  multiple  types  of  filters  for  motion  prediction. 
In  the  implementation  evaluation  for  this  paper  we  used  a 
Kalman  filter  for  smoothing  and  prediction  [40],  [41]  to 


determine  the  search  window  in  the  next  frame,  Ik+i-  The 
Kalman  filter  is  a  recursive  filter  that  estimates  the  state, 
x/c,  of  a  linear  dynamical  system  from  a  series  of  noisy 
measurements,  z At  each  time  step  k  the  state  transition 
model  is  applied  to  the  state  to  generate  the  new  state, 

Xfc+i  =FfeXfe  +  Vfc  (5) 

assuming  a  linear  additive  Gaussian  process  noise  model. 
The  measurement  equation  under  uncertainty  generates  the 
observed  outputs  from  the  true  (’’hidden”)  state. 

Z  k  =  H &  X/e  +  Wfc  (6) 

where  denotes  process  noise  (Gaussian  with  zero-mean 
and  covariance  Q&),  denotes  measurement  noise  (Gaussian 
with  zero-mean  and  covariance  R^).  The  system  plant  is 
modeled  by  known  linear  systems,  where  F&  is  the  state- 
transition  matrix  and  is  the  observation  model. 

Possible  target  locations  within  the  search  window  are 
denoted  by  peak  locations  in  the  fused  posterior  vehicle 
likelihood  map.  Candidate  locations  are  then  filtered  by  in¬ 
corporating  the  prediction  information.  Given  a  case  where 
feature  fusion  indicates  low  probability  of  the  target  location 
(due  to  occlusions,  image  distortions,  inadequacy  of  features  to 
localize  the  object,  etc.)  the  filtering-based  predicted  position 
is  then  reported  as  the  target  location.  Figure  5  shows  LoFT 
with  the  appearance-based  update  module  being  active  over 
the  track  segments  in  yellow  with  informative  search  windows, 
whereas  in  the  shadow  region  the  appearance-based  features 
become  unreliable  and  LoFT  switches  to  using  only  filtering- 
based  prediction  mode  (track  segments  in  white). 


Fig.  5.  Adaptation  to  changing  environmental  situations.  LoFT  switches 
between  using  fused  feature-  and  hlterin-based  target  localization  (yellow 
boxes)  within  informative  search  windows  (yellow  boxes)  and  predominantly 
filtering  based  localization  in  uninformative  search  windows  (white  boxes). 

C.  Target  vs1  Environment  Contrast 

LoFT  measures  the  dissimilarity  between  the  target  and  its 
surrounding  environment  in  order  to  assess  the  presence  of 
occlusion  events.  If  the  VR  between  the  target  and  its  envi¬ 
ronment  is  below  a  threshold,  this  indicates  a  high  probability 
that  the  tracker/target  is  within  an  occluded  region.  In  such 
situations,  LoFT  relies  more  heavily  on  the  dynamical  filter 
predictions.  Figure  6  shows  a  sample  frame  which  illustrates 
the  difference  between  high  and  low  VR  locations. 

D.  Image/Camera  Boundary  Check 

LoFT  determines  if  the  target  is  leaving  the  scene,  crossing 
a  seam  or  entering  an  image  boundary  region  on  every 
iteration  in  order  to  test  for  the  disappearance  of  targets.  If  the 
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Fig.  6.  Pixels  within  the  red  rectangle  form  the  foreground  (Fg)  distribution, 
pixels  between  the  red  and  blue  rectangles  form  the  background  (Bg)  distri¬ 
bution.  Left:  High  VR  when  Fg  and  Bg  regions  have  different  distributions. 
Right:  Low  VR  when  Fg  and  Bg  regions  have  similar  distributions. 


Fig.  7.  Termination  of  tracks  for  targets  leaving  the  working  image  boundary. 

predicted  location  is  out  of  the  working  boundary,  the  tracker 
automatically  terminates  to  avoid  data  access  issues  (Figure  7). 


frames,  with  accurate  axis  aligned  boxes  using  KOLAM  [5], 
[6],  [47]  or  MIT  Layer  Annotation  Tool  [48]. 


Fig.  8.  Example  of  challenging  conditions:  Target  appearance  changes  during 
turning  (C4-1-0),  low  contrast  and  shadows  (C3-3-4),  shadow  occlusion  (C0- 
3-0)  and  combined  building  and  shadow  occlusion  (C2-4-1)  [49]. 


V.  Experimental  Results 

A.  Datasets  Used 

LoFT  was  evaluated  using  the  Columbus  Large  Image 
Format  (CLIF)  [42]  WAMI  dataset  which  has  a  number  of 
challenging  conditions  such  as  shadows,  occlusions,  turning 
vehicles,  low  contrast  and  fast  vehicle  motion.  We  used  the 
same  vehicles  selected  in  [11]  which  have  a  total  of  455 
ground-truth  locations  of  which  more  than  22%  are  occluded 
locations.  The  short  track  lengths  combined  with  a  high  degree 
of  occlusions  makes  the  tracking  task  especially  challenging. 
Several  examples  of  the  difficulties  in  vehicle  tracking  in 
CLIF  are  illustrated  in  Figure  8.  Figure  9  shows  that  half 
the  sequences  in  this  sample  set  of  tracks  have  a  significant 
amount  of  occluded  regions  and  Table  I  summarizes  the 
challenges  in  each  sequence.  We  used  several  FMV  sequences 
which  have  been  used  to  benchmark  a  number  of  published 
tracking  algorithms  in  the  literature.  These  sequences  include: 
’girl’,  ’david’,  ’faceocc’,  ’faceocc2’  [20]  and  allow  comparison 
of  LoFT  against  a  number  of  existing  tracker  results  for  which 
source  code  may  not  be  available. 

B.  Registration  and  Ground-Truth  for  CLIF  WAMI 

In  our  tests  we  used  the  same  homographies  as  in  [11]  that 
were  estimated  using  SIFT  (Scale  Invariant  Feature  Transform) 
[43]  with  RANSAC  to  map  each  frame  in  a  sequence  to 
the  first  base  frame.  Several  other  approaches  have  been 
used  to  register  CLIF  imagery  including  Lucas-Kanade,  and 
correlation-based  [44],  or  can  be  adapted  for  WAMI  [45], 
[46].  Using  these  homographies  we  registered  consecutive 
frames  to  the  first  frame  in  each  sequence.  The  homographies 
when  applied  to  the  ground-truth  bounding  boxes  can  produce 
inaccurate  quadrilaterals  since  these  transformations  are  on 
a  global  frame  level.  All  quadrilaterals  were  automatically 
replaced  with  axis  aligned  boxes  and  visually  inspected  to 
manually  replace  any  incorrect  bounding  boxes,  on  registered 


C0_3_0 
Cl_2_0 
Cl_4_0 
Cl_4_6 
C2_4_l 
C3_3_4 
C4_l_0 
C4_3_0 
C4_4_l 
C4_4_4 
C5_l_4 
C5_2_0 
C5_2_0 
C5  41 


Fig.  9.  Distribution  of  occluded  frames  in  the  14  CLIF  seq.  Black:  fully 
occluded,  Gray:  partially  occluded.  Target  is  occluded  in  22.4%  of  the  frames. 


Seq.  No 

Challenges 

Track 

Length 

Target 

Size  [pixel] 

Occ.Fr 

C0_3_0 

Occlusion 

50 

17x25 

17 

Cl  2  0 

Occlusion 

27 

21x15 

2 

Cl_4_0 

Occlusion 

50 

21x17 

21 

Cl_4_6 

Occlusion 

50 

25x25 

15 

C2  4  1 

Occlusion 

50 

25x17 

32 

C3_3_4 

Occlusion 

27 

27x17 

12 

C4_l_0 

Turning  car 

18 

15x25 

- 

C4  3  0 

Occlusion 

20 

21x17 

3 

C4_4_l 

Low  contrast 

30 

17x21 

- 

C4  4  4 

- 

13 

17x25 

- 

C5  1  4 

Fast  target  motion 

23 

27x11 

- 

C5_2_0 

Fast  target  motion 

49 

21x15 

- 

C5_3_7 

- 

27 

27x47 

- 

C5 4 l 

Low  Contrast 

21 

27x19 

- 

Total 

455 

102 

TABLE  I 

Characteristics  of  the  14  CLIF  sequences  summarized  from 

[11]  SHOWING  TRACK  LENGTH,  VEHICLE  TARGET  SIZE  AND  NUMBER  OF 
OCCLUDED  FRAMES.  IMAGE  FRAMES  ARE  2008  X  1336  PIXELS. 

C.  Quantitative  Comparison 

We  used  several  retrieval  or  detection-based  performance 
metrics  to  evaluate  the  trackers.  The  first  one  is  the  Missing 
Frame  Rate  (MFR),  which  is  the  percentage  of  number  of 
missing  frames  to  the  total  number  of  ground-truth  frames, 

„  #  missinq  frames 

=  #  total  GT  ~frames  (  } 
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A  frame  is  labeled  as  missing ,  if  the  detected/estimated 
object  location  with  associated  bounding  box  overlaps  with 
the  ground-truth  bounding  box  by  less  than  1%  or  there  is 
no  estimated  bounding  box  at  all  (i.e.  due  to  early  track 
termination).  The  one  percent  overlap  threshold  is  the  correct 
one  that  was  actually  used  in  the  CLIF  experiments  reported  in 
Ling  et  al.  [11]  not  50%  (personal  communication).  We  used 
bounding  boxes  of  roughly  the  same  size  as  the  target  at  the 
predicted  location;  note  that  MFR  does  not  explicitly  penalize 
the  use  of  large  bounding  boxes. 

Two  commonly  used  criteria  are  precision  and  recall  scores 
for  the  tracker  detected/estimated  (single)  target  locations  [50]. 
Precision  (related  to  track  purity)  is  defined  as  the  ratio  of  the 
number  of  correctly  tracked  frames,  \TP\,  to  total  number  of 
tracked  frames  or  track  length, 

_  S  correct  frames  \TP\ 

Precision  =  — - - — — - =  — — ^ — p— —  (8) 

#  tracked  frames  \TP\  4-  \FP\ 

where  number  of  correct  frames  are  those  in  which  target 
locations  are  within  a  set  threshold  distance  from  the  ground- 
truth  ( ie .  20  pixel  radius  ribbon).  Recall  (related  to  target 
purity)  is  the  ratio  of  number  of  correctly  tracked  frames  to 
number  of  ground-truth  frames  for  the  target  defined  as. 


Recall  = 


#  correct  frames 
#  total  GT  frames 


\TP\ 

\TP\  +  \FN\ 


1  -MFR. 
(9) 


The  equality  is  approximate  since  MFR  uses  a  bounding  box 
overlap  criteria  whereas  precision  and  recall  used  a  distance 
from  ground-truth  centroid  metric.  The  final  performance  met¬ 
ric  used  for  evaluating  trackers  is  the  detected  position  errors 
defined  as  the  distance  between  the  estimated  object  posi¬ 
tion  and  the  ground-truth  centroid.  Track-based  completeness, 
fragmentation,  mean  track  length,  id  switches,  gaps  and  other 
measures  of  multi-target  tracking  performance  are  necessary 
for  a  more  thorough  evaluation  of  tracking  performance  [13]. 

Some  LoFT  (vl.3)  modules  (in  Figure  1)  were  turned  off  for 
the  experiments  including  binary  classifier,  background  sub¬ 
traction,  and  MHT  in  order  to  focus  on  evaluating  the  appear¬ 
ance  update  performance.  LoFT  performance  was  compared  to 
several  state-of-the-art  trackers.  Table  II  summarizes  the  MFR 
scores  of  eight  trackers  on  CLIF  data.  Table  III  shows  overall 
precision-recall  scores  for  five  of  the  trackers  on  the  14  CLIF 
sequences;  we  used  author  provided  source  code  for  Nearest- 
Neighbor  (NN)  [20],  Ll-BPR  Sparse  Tracker  [25],  Multiple 
Instance  Learning  (MILTrack)  [21],  and  P-N  Tracker  [22]. 
We  did  some  limited  parameter  tuning  for  optimizing  each 
tracker  for  both  CLIF  and  FMV  separately.  Figure  10  shows 
position  errors  three  sample  CLIF  sequences  where  LoFT 
does  particularly  well.  These  comparisons  show  that  our  LoFT 
tracker  outperforms  all  other  trackers  on  this  CLIF  dataset. 
According  to  the  MFR  scores,  MILTrack  and  Ll-BPR  Sparse 
trackers  produced  results  comparable  to  LoFT  for  some  of  the 
sequences,  however,  the  lack  of  a  termination  module  causes 
their  precision  scores  to  drop  significantly  in  Table  III.  The  P- 
N  tracker  has  very  good  performance  on  FMV,  but  the  search 
method  involves  scanning  the  entire  image  and  thus  testing  on 


Method 

Precision 

Recall 

Ll-BPR  [25] 

0.185 

0.185 

MILTrack  [21] 

0.271 

0.271 

P-N  [22] 

0.373 

0.172 

NN  [20] 

0.088 

0.082 

LoFT 

0.603 

0.405 

TABLE  III 

Overall  Precision  -  Recall  scores  across  14  CLIF  sequences. 
Second  best  performance  underlined. 


Fig.  10.  Position  error  over  the  entire  sequence  in  pixels  versus  frame  index 
for  five  of  the  trackers  on  three  selected  CLIF  sequences  (Cl_4_6,  C4_4_l 
and  C5_3_7)  for  which  LoFT  has  a  high  accuracy. 

WAMI  posed  severe  memory  constraints.  P-N  tracker  has  the 
second  highest  precision  on  the  CLIF  data.  The  NN  tracker  had 
the  worst  results  on  CLIF  WAMI  likely  due  to  the  need  to  tune 
the  SIFT  features.  Figure  1 1  shows  some  visual  trajectories  of 
tracking  results  where  LoFT  does  well.  Two  sequences  where 
LOFT  did  not  do  well,  are  C3_3_4  which  is  challenging  for  all 
of  the  trackers,  and  C4_l_0  which  has  many  nearby  spatial 
and  temporal  distractors  while  turning;  see  Figure  8  for  the 
visual  appearance  of  these  targets  and  environments. 

Since  most  published  trackers  are  designed  for  standard 
FMV  sequences,  we  also  evaluated  LoFT  on  several  popu¬ 
lar  benchmark  videos  with  very  different  scene  content  and 
characteristics  compared  to  WAMI.  Table  IV  shows  the  mean 
distance  error  to  ground-truth  for  eight  published  trackers 
including  LoFT,  on  four  standard  FMV  sequences  across 
all  frames  of  each  sequence.  The  PROST,  AdaBoost  and 
FragTrack  results  are  from  Gu  et  al  [20].  Figure  12  shows 


Fig.  11.  LoFT  results  (red  tracks)  for  six  sequences.  Top:  C0_3_0,  Cl_4_6, 
Middle:  C2_4_l,  C4_4_l,  Bottom:  Cl_2_0,  C4_4_4,  showing  enhanced 
images  with  ground-truth  tracks  in  yellow.  LoFT  outperforms  other  trackers 
in  this  set  of  sequences. 
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CLIF  Seq. 

MIL  [21] 

MS  [51] 

CPF  [52] 

HPF  [53] 

Ll-BPR  [11] 

NN  [20] 

PN  [22] 

LoFT 

CO  3  0 

0.860 

0.980 

0.940 

0.980 

0.760 

0.920 

0.940 

0.760 

Cl  2  0 

0.852 

0.963 

0.9636 

0.963 

0.630 

0.962 

0.962 

0.074 

Cl  4  0 

0.680 

0.780 

0.740 

0.760 

0.620 

1.000 

0.700 

0.820 

Cl  4  6 

0.360 

0.940 

0.800 

0.880 

0.360 

0.980 

0.560 

0.040 

C2  4  1 

0.900 

0.980 

0.980 

0.980 

0.920 

0.980 

0.980 

0.440 

C3_3_4 

0.963 

0.963 

0.963 

0.963 

0.704 

0.962 

0.962 

0.889 

C4  1  0 

0.389 

0.889 

0.833 

0.889 

0.389 

0.888 

0.944 

1.000 

C4_3_0 

0.650 

0.950 

0.950 

0.800 

0.750 

0.947 

— 

0.100 

C4_4_l 

0.533 

0.967 

0.900 

0.900 

0.033 

0.931 

0.758 

0.035 

C4_4_4 

0.000 

0.923 

0.385 

0.307 

0.000 

0.076 

0.923 

0.000 

C5  1  4 

0.667 

0.958 

0.875 

0.833 

0.667 

0.958 

0.958 

0.792 

C5  2  0 

0.918 

0.979 

0.959 

0.979 

0.979 

0.979 

— 

0.918 

C5  3  7 

0.000 

0.963 

0.148 

0.000 

0.000 

0.8516 

0.259 

0.037 

C5 4 l 

0.000 

0.952 

0.810 

0.905 

0.958 

0.523 

0.809 

0.000 

Mean 

0.555 

0.942 

0.803 

0.796 

0.555 

0.854 

0.813 

0.422 

OverAll 

0.627 

0.940 

0.833 

0.837 

0.611 

0.909 

0.680 

0.473 

TABLE  n 

Missing  frame  rate  (MFR)  on  CLIF  WAMI  (lower  is  better).  Results  for  Multiple  Instance  Learning  Tracker  (MIL),  Mean  Shift 
tracker  (MS),  Covariance  Based  Particle  Filter  (CPF)  tracker.  Histogram-based  Particle  Filter  (HPF)  tracker  and  £t -Bounded 
Particle  Resampling  (Ll-BPR  or  Sparse)  tracker  are  from  Ling  et  al.  [11].  Mean  indicates  average  of  sequence  MFRs  (shorter 

TRACKS  HAVE  HIGHER  INFLUENCE)  WHILE  OVERALL  IS  AN  ENSEMBLE  AVERAGE  AS  IN  [11]. 


Sequence 

PROST  [54] 

AdaBoost  [55] 

FragTrack  [56] 

MILTrack  [21] 

NN  [20] 

Ll-BPR  [24] 

PN  [22] 

LoFT 

Girl 

19.00 

43.30 

26.50 

31.60 

18.00 

67.84 

28.88 

13.86 

David 

15.30 

51.00 

46.00 

15.60 

15.60 

63.12 

10.38 

40.60 

Faoeocc 

7.00 

49.00 

6.50 

18.40 

10.00 

20.78 

13.99 

10.79 

Faceocc2 

17.20 

19.60 

45.10 

14.30 

12.90 

73.27 

19.14 

13.25 

TABLE  IV 

Mean  Position  Errors  on  standard  full  motion  Videos  with  PROST,  AdaBoost,  FragTrack,  MILTrack  and  NN  results  from  Gu  et 
al .  [20].  Best  results  and  second  best  results  are  shown  in  bold  and  underlined  respectively. 


Fig.  12.  Tracking  results  showing  sample  frames  from  ’girl’  and  *faceocc2’ 
sequences  showing  bounding  boxes  for  ground-truth  (yellow)  and  LoFT  (red). 

sample  frames  from  LoFT  tracking  results  compared  to  GT 
for  ’girl’  and  ’faceocc2’  sequences.  Instead  of  tight  initial 
bounding  boxes  we  used  the  actual  GT  bounding  box  on  the 
appropriate  start  frame  in  each  FMV  sequence.  Based  on  the 
mean  distance  errors,  the  LoFT  system  is  comparable  to  the 
other  trackers  on  these  four  representative  FMV  sequences. 
LoFT  also  produced  better  results  for  these  two  videos. 

VI.  Conclusions 

We  described  our  Likelihood  of  Features  Tracking  (LoFT) 
system  developed  to  track  vehicles  of  interest  in  challenging 
low  frame  rate  aerial  WAMI,  within  a  single  target  tracking 
context.  LoFT  uses  an  adaptive  set  of  feature  descriptors  with 
posterior  fusion  modeled  as  recognition-based  track-before- 
detect,  a  novel  appearance  and  pose  estimation  algorithm, 
coupled  with  a  track  management  module  to  achieve  much 
better  performance  compared  to  other  state-of-the-art  trackers 
including  Ll-BPR,  a  sparse  representation-based  tracking  ap¬ 
proach  also  adapted  for  WAMI,  and  learning-based  tracking 
algorithms  like  MILTrack  and  P-N  Tracker.  On  the  CLIF 
dataset  LoFT  improves  on  the  best  previous  results  by  13.8% 


(Ll-BPR)  and  15.4%  (MILTrack)  using  the  MFR  metric.  In 
terms  of  precision  and  recall  LoFT  was  23.0%  and  13.4% 
higher  respectively,  compared  to  the  second  best  trackers.  On 
FMV  data  LoFT  performs  quite  competitively  with  the  best 
trackers  in  the  literature  and  parameters  were  not  customized 
or  tuned  specifically  for  FMV  in  these  preliminary  tests.  The 
versatility  of  our  approach  for  a  range  of  tracking  tasks  was 
demonstrated  by  LoFTs  competitive  performance  on  both 
WAMI  and  FMV  sequences  with  very  different  foreground- 
background  characteristics,  camera  geometry  and  framerates. 
LoFT  is  not  restricted  to  single  target  tracking  and  can 
be  readily  extended  to  multi-target  tracking  using  multiple 
trackers  running  in  parallel  in  a  supervised  manner,  or  given 
a  collection  of  detections  or  moving  target  indicators,  for 
unsupervised  automatic  tracking. 
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